Read in data and understand it

The first step of data analysis is to read our dataset into selected programing language and have an overall understanding of it.

# read in data, change file name or location for different
# cases
mydata <- read.csv("C:/Users/dayif/Desktop/lets do it/New Projects/Used Auto Ads Data - Ebay .csv", 
    header = TRUE)
# use dim to check how large is our data
dim(mydata)
## [1] 371539     20
# use sty to get further information
str(mydata)
## 'data.frame':    371539 obs. of  20 variables:
##  $ dateCrawled        : Factor w/ 15623 levels "3","3/10/16 0:24",..: 6976 6954 2011 3653 10618 14988 13545 5651 15183 3500 ...
##  $ name               : Factor w/ 233527 levels "'Showcar_und_Messefahrzeug'_Opel_Astra_G_Cabrio",..: 80277 4569 92315 82400 172986 28933 147772 215523 64574 217953 ...
##  $ seller             : Factor w/ 4 levels "","90","gewerblich",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ offerType          : Factor w/ 4 levels "","Angebot","Gesuch",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ price              : int  480 18300 9800 1500 3600 650 2200 0 14500 999 ...
##  $ abtest             : Factor w/ 4 levels "","4","control",..: 4 4 4 4 4 4 4 4 3 4 ...
##  $ vehicleType        : Factor w/ 10 levels "","andere","benzin",..: 1 6 10 7 7 9 5 9 4 7 ...
##  $ yearOfRegistration : Factor w/ 157 levels "","1000","1001",..: 90 108 101 98 105 92 101 77 111 95 ...
##  $ gearbox            : Factor w/ 3 levels "","automatik",..: 3 3 2 3 3 3 3 3 3 3 ...
##  $ powerPS            : Factor w/ 796 levels "","0","1","10",..: 2 248 187 722 704 19 40 585 91 13 ...
##  $ model              : Factor w/ 253 levels "","0","1_reihe",..: 121 1 122 121 106 14 10 43 59 121 ...
##  $ kilometer          : int  150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
##  $ monthOfRegistration: Factor w/ 15 levels "","0","1","10",..: 2 11 14 12 13 4 14 13 14 2 ...
##  $ fuelType           : Factor w/ 8 levels "","andere","benzin",..: 3 5 5 3 5 3 3 3 3 1 ...
##  $ brand              : Factor w/ 41 levels "","alfa_romeo",..: 40 3 16 40 33 4 27 40 12 40 ...
##  $ notRepairedDamage  : Factor w/ 3 levels "","ja","nein": 1 2 1 3 3 2 3 3 1 1 ...
##  $ dateCreated        : Factor w/ 115 levels "","1/10/16 0:00",..: 88 88 76 79 96 106 103 85 106 79 ...
##  $ nrOfPictures       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ postalCode         : int  70435 66954 90480 91074 60437 33775 67112 19348 94505 27472 ...
##  $ lastSeen           : Factor w/ 18706 levels "","3/10/16 0:14",..: 18646 18574 17994 4257 18328 18427 18069 9044 17698 12725 ...

The dim() command returns number of rows and columns of our dataset. It contains 371539 different pieces of information and 20 variables, fairly large. The str() command helps us get a basic understanding of the variables: type, number, and several examples.

Eliminating useless variables

In real world data analysis, a data set sometimes includes redundant variables that are not contributing to our analysis. In “Ebay used auto ads” data set, variable “name”, “seller”, “offerType”, “abtest”, and “nrofPictures” are either uninfluencial or ambiguous. Therefore, we throw them out from analysis

# throw out unnecessary variables, mostly determined by
# general knowledge and real-world experience
myvars <- names(mydata) %in% c("name", "seller", "offerType", 
    "abtest", "nrOfPictures")
mydata = mydata[!myvars]

Now, we have 15 contributing variables to work on. However, we have to be carful when eliminating variables, since some of them may have potential effects on our models.

Clean dataset according to date values

I would like to start cleaning our data set from the strictest standards, which are the unbreakable nature rules. In our case, I remove all the rows do not have monthofregistration value in [1,12] range. What is more, I also remove all rows with yearofRegistration value greater than 2017 or less than 1960 (not sure if doing this is correct).

library(ggplot2)
# clean the data from the strictest standard, such as the
# month of registration can not be any value except 1-12...
# convert factor type to numeric type
mydata$monthOfRegistration = as.numeric(as.character(mydata$monthOfRegistration))
## Warning: NAs introduced by coercion
mydata = subset(mydata, monthOfRegistration >= 1 & monthOfRegistration <= 
    12)
summary(mydata$monthOfRegistration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   6.000   6.382   9.000  12.000
# year of registration, use boxplot to analyze variable
# distribution, it is reasonable to eliminate any value less
# than 1960 and greater than 2017
mydata$yearOfRegistration = as.numeric(as.character(mydata$yearOfRegistration))
summary(mydata$yearOfRegistration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    1999    2004    2004    2008    9999
ggplot(aes(x = vehicleType, y = yearOfRegistration), data = mydata) + 
    geom_boxplot() + ylim(1975, 2017) + labs(x = "vehicle type", 
    y = "year of registration") + ggtitle("year of registration vs vehicle type boxplot")
## Warning: Removed 5460 rows containing non-finite values (stat_boxplot).

mydata = subset(mydata, yearOfRegistration >= 1960 & yearOfRegistration <= 
    2017)
summary(mydata$yearOfRegistration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1960    1999    2004    2003    2008    2017

It is noticeable that we need to convert variable type from factor to numeric before cleaning the data. The boxplot of yearofRegistration range from 1975 to 2017 shows us the 1st and 3rd quantile year range according to different vehicle type (). Based on the boxplot, it is reasonable for us to thorw out year values less than 1960 as outliers (I still not sure whether we should do this or not).

Clean all categorical variables

From last part, we found that there are “blank”s in “vehicleType” variable. This also happens within other variables and may cause problems. Unlike continous variable, categorical variables cannot be imputed through certain methods. I just tag those blank’s as “unknown” or “None” in each variable.

# take a look at main categorical variables and rename blank
# cell to 'None' type
summary(mydata$vehicleType)
##                andere     benzin        bus     cabrio      coupe 
##      19633       2781          0      28730      21576      17551 
## kleinwagen      kombi  limousine        suv 
##      73156      62807      90055      14057
mydata$vehicleType = sub("^$", "None", mydata$vehicleType)
mydata$vehicleType = as.factor(mydata$vehicleType)
summary(mydata$vehicleType)
##     andere        bus     cabrio      coupe kleinwagen      kombi 
##       2781      28730      21576      17551      73156      62807 
##  limousine       None        suv 
##      90055      19633      14057
ggplot(mydata, aes(x = vehicleType)) + geom_bar(fill = "blue", 
    color = "black") + labs(x = "vehicle type", y = "number of cars") + 
    ggtitle("vehicle type Frequency Diagram")

# mydata = mydata[!(is.na(mydata$vehicleType) |
# mydata$vehicleType==''), ] another way to clean na's and
# blank's

# gearbox
mydata$gearbox = sub("^$", "Unknown", mydata$gearbox)
mydata$gearbox = as.factor(mydata$gearbox)
summary(mydata$gearbox)
## automatik   manuell   Unknown 
##     72396    247994      9956
ggplot(mydata, aes(x = gearbox)) + geom_bar(fill = "darkgreen", 
    color = "black") + labs(x = "gearbox", y = "number of cars") + 
    ggtitle("gearbox Frequency Diagram")

# fuelType
mydata$fuelType = sub("^$", "Unknown", mydata$fuelType)
mydata$fuelType = as.factor(mydata$fuelType)
summary(mydata$fuelType)
##  andere  benzin     cng  diesel elektro  hybrid     lpg Unknown 
##     161  203789     534  102277      95     270    4927   18293
ggplot(mydata, aes(x = fuelType)) + geom_bar(fill = "red", color = "black") + 
    labs(x = "fuel type", y = "number of cars") + ggtitle("fuel type Frequency Diagram")

# model
summary(mydata$model)
##        golf      andere         3er                    polo       corsa 
##       26321       23629       18655       13329       11324       10881 
##       astra          a4      passat    c_klasse         5er    e_klasse 
##        9543        9384        9265        8203        7999        7038 
##          a3          a6       focus      fiesta transporter     2_reihe 
##        6111        5603        5398        5068        4932        4510 
##      twingo      fortwo    a_klasse         1er      vectra      touran 
##        4185        4033        3948        3735        3607        3315 
##     3_reihe      mondeo        clio       punto      zafira      megane 
##        3216        3206        3137        2872        2744        2607 
##       ibiza          ka        lupo     x_reihe     octavia      cooper 
##        2441        2302        2276        2241        2109        2058 
##       fabia         clk       micra       caddy      sharan         slk 
##        1983        1712        1534        1478        1426        1324 
##          80        leon      scenic          tt     i_reihe      laguna 
##        1304        1303        1294        1255        1224        1205 
##     6_reihe       omega     1_reihe       civic    m_klasse         7er 
##        1175        1168        1156        1149        1099        1045 
##      galaxy          a5      meriva       yaris    s_klasse    mx_reihe 
##        1035        1002         993         985         942         918 
##         911    b_klasse      tiguan         500         one     z_reihe 
##         911         908         893         873         870         856 
##      kangoo        vito      beetle        bora        colt       arosa 
##         842         841         809         772         771         769 
##    berlingo    sprinter     touareg      escort         v40     transit 
##         743         721         721         718         695         686 
##         fox       tigra    insignia       c_max       swift     corolla 
##         677         675         670         660         636         633 
##          sl       panda          a1    scirocco         v70     4_reihe 
##         628         620         614         596         590         579 
##       grand    seicento     primera     avensis         156     qashqai 
##         579         564         552         545         537         533 
##          a8       stilo         147     (Other) 
##         532         530         524       32539
ggplot(mydata, aes(x = model)) + geom_bar(fill = "yellow", color = "black") + 
    labs(x = "model", y = "number of cars") + ggtitle("model Frequency Diagram")

# brand
summary(mydata$brand)
##                    alfa_romeo           audi            bmw      chevrolet 
##              0           2072          29965          36677           1628 
##       chrysler        citroen          dacia         daewoo       daihatsu 
##           1278           4667            850            487            682 
##           fiat           ford          honda        hyundai         jaguar 
##           8333          22372           2474           3377            579 
##           jeep            kia           lada         lancia     land_rover 
##            715           2379            182            416            717 
##          mazda  mercedes_benz           mini     mitsubishi         nissan 
##           5046          32569           3264           2687           4488 
##           opel        peugeot        porsche        renault          rover 
##          34538           9932           2079          15553            410 
##           saab           seat          skoda          smart sonstige_autos 
##            493           6296           5351           4819           2877 
##         subaru         suzuki         toyota        trabant     volkswagen 
##            665           2088           4391            327          69599 
##          volvo 
##           3024
ggplot(mydata, aes(x = brand)) + geom_bar(fill = "orange", color = "black") + 
    labs(x = "brand", y = "number of cars") + ggtitle("brand Frequency Diagram")

# notRepairedDamage
mydata$notRepairedDamage = sub("^$", "Unknown", mydata$notRepairedDamage)
mydata$notRepairedDamage = sub("Unknown", "No", mydata$notRepairedDamage)
mydata$notRepairedDamage = sub("No", "other", mydata$notRepairedDamage)
mydata$notRepairedDamage = sub("ja", "yes", mydata$notRepairedDamage)
mydata$notRepairedDamage = sub("nein", "Maybe", mydata$notRepairedDamage)
mydata$notRepairedDamage = as.factor(mydata$notRepairedDamage)
# I actually do not know what 'ja' or 'nein' mean, I googled
# them, they are actually 'yes' and 'maybe' in German. So I
# changed 'unknown' to 'other'
summary(mydata$notRepairedDamage)
##  Maybe  other    yes 
## 250924  48544  30878
ggplot(mydata, aes(x = notRepairedDamage)) + geom_bar(fill = "purple", 
    color = "black") + labs(x = "notRepairedDamage", y = "number of cars") + 
    ggtitle("notRepairedDamage Frequency Diagram")

In this section, I checked all our categorical variables and replaced blank cells to either “Unknown” or “No” for different categories. From distribution graphs, we do not notice any unnormal patterns.

Adding age and selling time variables

Since we have the same format of time variables, we can use them to calculate our desired values.

# age is calculated by today's year minus year of
# registration, in number of years
mydata$age = (year(today()) - mydata$yearOfRegistration)
summary(mydata$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   13.00   13.51   18.00   57.00
ggplot(mydata, aes(y = age, x = vehicleType)) + geom_boxplot() + 
    labs(x = "vehicle type", y = "age") + ggtitle("age vs vehicle type boxplot")

# selling time is calculated by lastseen minus datacreated,
# in number of days
mydata$sellingTime <- as.integer(as.Date(mydata$lastSeen) - as.Date(mydata$dateCreated))
summary(mydata$sellingTime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   6.000   8.891  14.000 300.000
ggplot(mydata, aes(y = sellingTime, x = vehicleType)) + geom_boxplot() + 
    labs(x = "vehicle type", y = "selling time") + ggtitle("selling time vs vehicle type boxplot")

It is reasonable to set age as the difference in years between today and the year of registration. However,

Take care of three continuous variables

From the str() function, we can find that there are three continuous variables in our data set. They are “price”, “powerPS”, and “Kilometer”. It is important to have them in our analysis as response and affecting variables. Let us take a look.

# take a look at our main response variable price and remove
# na's
summary(mydata$price)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 1.300e+03 3.300e+03 1.579e+04 7.800e+03 2.147e+09
options(scipen = 5, digits = 4)  # no scientific notation
mydata = mydata[complete.cases(mydata$price), ]
summary(mydata$price)  # there are some outliers, however we keep them and deal with them later
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
##          0       1300       3300      15800       7800 2150000000
ggplot(mydata, aes(x = price)) + geom_bar(fill = "darkgreen", 
    color = "black") + labs(x = "price", y = "number of cars") + 
    ggtitle("price Frequency Diagram")

# convert powerPS from factor to numeric data, and we will
# deal with outliers later
mydata$powerPS = as.numeric(as.character(mydata$powerPS))
mydata = mydata[complete.cases(mydata$powerPS), ]
summary(mydata$powerPS)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      75     110     121     150   20000
ggplot(mydata, aes(x = powerPS)) + geom_histogram(fill = "orange", 
    color = "black", binwidth = 20) + labs(x = "engine power", 
    y = "number of cars") + ggtitle("engine power Frequency Diagram")

# kilometer
summary(mydata$kilometer)  # seems ok to use 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5000  100000  150000  125000  150000  150000
ggplot(mydata, aes(x = kilometer)) + geom_bar(fill = "purple", 
    color = "black") + labs(x = "kilometer", y = "number of cars") + 
    ggtitle("kilometer Frequency Diagram")

Based on the graphs, we can find that both “price” and “powerPS” variables have extreme outliers that affect our univariate analysis. We need to remove those outliers in order to get a good overall picture of our data. For variable “kilometer”, it is interesting to find that more than half of the used cars are assigned to 150,000 kilometer.

# from the graphs of engine power and price, we can find that
# extreme outliers are affecting our analysis of the
# variables. Therefore, we have to remove them.
mydata <- subset(mydata, price < quantile(mydata$price, 0.95))
mydata <- subset(mydata, powerPS < quantile(mydata$powerPS, 0.95))
# it is also reasonable to believe that neither price nor
# engine power should have 0 values.
mydata <- subset(mydata, price > 0)
mydata <- subset(mydata, powerPS > 0)

Now, we re-plot these two graphs.

ggplot(mydata, aes(x = price)) + geom_bar(fill = "darkgreen", 
    color = "black") + labs(x = "price", y = "number of cars") + 
    ggtitle("price Frequency Diagram")

ggplot(mydata, aes(x = powerPS)) + geom_histogram(fill = "orange", 
    color = "black", binwidth = 20) + labs(x = "engine power", 
    y = "number of cars") + ggtitle("engine power Frequency Diagram")

Bivariate analysis

Before doing bivariate analysis, we need to target several important variables and mainly describe relationships between them. We are interested in price, engine power, kilometer, selling time, age, and vehicle type. let us make some plots and interpret them.

ggplot(data = mydata, aes(x = powerPS, y = price)) + geom_point(alpha = 0.02, 
    color = I("red"), position = "jitter") + geom_smooth() + 
    facet_wrap(~vehicleType) + xlab("Engine Power") + ylab("Price") + 
    ggtitle("Engine Power vs. Price")

These plots show clear linear relationship between engine power and price. However, we still need to consider the noise caused by outliers…

ggplot(data = mydata, aes(x = kilometer, y = price)) + geom_point(alpha = 0.02, 
    color = I("red"), position = "jitter") + geom_smooth() + 
    facet_wrap(~vehicleType) + xlab("Kilometer") + ylab("Price") + 
    ggtitle("Kilometer vs. Price")

These plots give us an idea that used cars’ prices are decreasing as their kilometer increase. However, there is an increasing pattern of used car price at the low kilometer range (0 - 50,000). It is hard to explain the exact reason for such pattern, but having fatal car damage on low kilometer used cars (force them to be sold at lower travel distance levels) might be a cause. What is more, outliers shoule also be considered as a distrubing reason here.

ggplot(data = mydata, aes(x = kilometer, y = price)) + geom_point(alpha = 0.02, 
    color = I("green"), position = "jitter") + geom_smooth() + 
    facet_wrap(~notRepairedDamage) + xlab("Kilometer") + ylab("Price") + 
    ggtitle("Kilometer vs. Price")

These graphs only inform us that not repaired damage affect used cars’ prices, but do not explain the sharp increasing price pattern for low kilometer used cars. Let us take a look at kilometer square then.

mydata$kilo2 = (mydata$kilometer^2)
summary(mydata$kilo2)
##        Min.     1st Qu.      Median        Mean     3rd Qu.        Max. 
##    25000000 15600000000 22500000000 17600000000 22500000000 22500000000
ggplot(data = mydata, aes(x = kilo2, y = price)) + geom_point(alpha = 0.02, 
    color = I("green"), position = "jitter") + geom_smooth() + 
    facet_wrap(~notRepairedDamage) + xlab("Kilometer square") + 
    ylab("Price") + ggtitle("Kilometer square vs. Price")

There is no better pattern shown by kilometer square vs price graphs. We do not need to include kilo square into our model.

ggplot(mydata, aes(x = vehicleType, y = price)) + geom_boxplot(aes(fill = vehicleType)) + 
    stat_summary(fun.y = mean, geom = "point", size = 3) + xlab("Vehicle Type") + 
    ylab("Price") + ggtitle("Price vs. Vehicle Type")

This graph exhibits that the suv type used car has a significant higher price than other car types, while andere, kleinwagen, and none (other) types of cars have compartively lower price range. It it reasonable to believe that certain car types have higher prices than other. Therefore, we should include vehicle type as a main factor in our regression model.

ggplot(mydata, aes(x = vehicleType, y = price)) + geom_boxplot(aes(fill = gearbox)) + 
    xlab("Vehicle Type") + ylab("Price") + ggtitle("Price vs. Vehicle Type by gearbox")

summary(mydata$gearbox)
## automatik   manuell   Unknown 
##     46228    218930      3930

This graph tells us that gearbox is also affecting used cars’ prices. It is not hard to understand that auto cars are more expensive than manual ones.

ggplot(data = mydata, aes(x = sellingTime, y = price)) + geom_point(alpha = 0.02, 
    color = I("red"), position = "jitter") + geom_smooth() + 
    facet_wrap(~vehicleType) + xlab("selling time") + ylab("Price") + 
    ggtitle("Selling time vs. Price")

we can not conclue much from these graphs, since selling time outliers are greatly affecting our models. If we only look at data records within 95% quantile, we may find someting useful.

quantile(mydata$sellingTime, 0.95)
## 95% 
##  27
selltime0.95_data = subset(mydata, sellingTime < quantile(mydata$sellingTime, 
    0.95))

ggplot(data = selltime0.95_data, aes(x = sellingTime, y = price)) + 
    geom_point(alpha = 0.02, color = I("red"), position = "jitter") + 
    geom_smooth() + facet_wrap(~vehicleType) + xlab("selling time") + 
    ylab("Price") + ggtitle("Selling time (0.95 quantile) vs. Price")

From these plot, we can find that selling time do not have significant influence on used cars’ prices. However, it is noticable that cars sold within short amount of time usually have lower prices compare to others. We will add selling time into our model and check if their are influencial.

ggplot(data = mydata, aes(x = age, y = price)) + geom_point(alpha = 0.02, 
    color = I("orange"), position = "jitter") + geom_smooth() + 
    facet_wrap(~vehicleType) + xlab("age") + ylab("Price") + 
    ggtitle("car age vs. Price")

It is interesting to find that there are two common patterns shared by these plots (except none type, it actually causing some noise). Used car prices are decreasing within car age range (1-20), and then showing an increasing price pattern for age range (20-40). Since these graphs show such clear patterns between age and price, we should take a deeper look and dig more about the age variable. A quadratic term “age^2” is introduced below.

mydata$age2 = (mydata$age^2)
summary(mydata$age2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      81     196     233     324    3250
ggplot(data = mydata, aes(x = age2, y = price)) + geom_point(alpha = 0.02, 
    color = I("darkgreen"), position = "jitter") + geom_smooth() + 
    facet_wrap(~vehicleType) + xlab("age square") + ylab("Price") + 
    ggtitle("car age square vs. Price")

Age square actually gives us beautiful pattern vs used car price. Let us narrow down the age square and take another look.

quantile(mydata$age2, 0.95)
## 95% 
## 576
age20.95_data = subset(mydata, age2 < quantile(mydata$age2, 0.95))

ggplot(data = age20.95_data, aes(x = age2, y = price)) + geom_point(alpha = 0.02, 
    color = I("darkgreen"), position = "jitter") + geom_smooth() + 
    facet_wrap(~vehicleType) + xlab("age square") + ylab("Price") + 
    ggtitle("car age square (0.95) vs. Price")

From these graphs, we can clearly find the decreasing quadratic pattern between age square and price. We will add age square into our model and keep the outliers for future analysis.

Variable selection

names(mydata)
##  [1] "dateCrawled"         "price"               "vehicleType"        
##  [4] "yearOfRegistration"  "gearbox"             "powerPS"            
##  [7] "model"               "kilometer"           "monthOfRegistration"
## [10] "fuelType"            "brand"               "notRepairedDamage"  
## [13] "dateCreated"         "postalCode"          "lastSeen"           
## [16] "age"                 "sellingTime"         "kilo2"              
## [19] "age2"

Before constructing models, I would like to use most of the variables I have analyzed above. Continuous variables: price, powerPS, Kilometer,selling time, age, and age2. Categorical variables: vehicleType, gearbox, model, fuelType, brand, and notRepairedDamge. Special variable: postalCode. we may introduce more variables as our analysis moving forward.

library(MASS)
summary(mydata$vehicleType)
##     andere        bus     cabrio      coupe kleinwagen      kombi 
##       2131      25239      16540      11529      65763      53034 
##  limousine       None        suv 
##      73080      13003       8769
summary(mydata$gearbox)  # only use 'automatik' and 'mauel' 
## automatik   manuell   Unknown 
##     46228    218930      3930
summary(mydata$model)  # we are not going to use model dummies, since there are too many variables in this category
##        golf      andere         3er        polo       corsa             
##       23255       17796       15421       10124        9584        8880 
##       astra      passat          a4    c_klasse         5er          a3 
##        8580        8229        7943        6716        5470        5094 
##       focus    e_klasse      fiesta     2_reihe transporter      fortwo 
##        4723        4628        4553        4249        3916        3725 
##      twingo          a6    a_klasse         1er      vectra      touran 
##        3531        3512        3430        3273        3158        2962 
##     3_reihe      mondeo        clio       punto      zafira       ibiza 
##        2864        2798        2732        2509        2466        2282 
##      megane        lupo          ka     octavia       fabia      cooper 
##        2230        2022        1942        1893        1861        1815 
##         clk       micra       caddy      sharan     i_reihe          80 
##        1342        1324        1295        1167        1146        1131 
##      scenic        leon     x_reihe     6_reihe      laguna       civic 
##        1115        1093        1091        1059        1039        1038 
##     1_reihe       omega         slk      meriva      galaxy       yaris 
##        1036        1034         984         933         927         909 
##    mx_reihe         one         500    b_klasse      beetle        bora 
##         872         835         827         800         728         727 
##          tt      kangoo        vito        colt       arosa    berlingo 
##         717         713         713         703         680         676 
##         fox      tiguan       c_max       tigra      escort     transit 
##         626         622         620         617         614         608 
##         v40    sprinter       swift          a1     corolla       panda 
##         599         596         587         578         568         555 
##    insignia     4_reihe     z_reihe     avensis    scirocco         v70 
##         541         531         516         515         513         505 
##     qashqai         147         eos         156     primera    seicento 
##         494         492         484         478         478         476 
##       stilo      almera       grand          c3      espace    m_klasse 
##         467         457         442         436         431         424 
##      signum          c5     5_reihe     (Other) 
##         411         407         401       23179
summary(mydata$fuelType)  # we are only about to use 'benzin' and 'diesel'
##  andere  benzin     cng  diesel elektro  hybrid     lpg Unknown 
##      78  170482     481   81056      70     202    3686   13033
mydata$bus <- ifelse(mydata$vehicleType == "bus", 1, 0)
mydata$cabrio <- ifelse(mydata$vehicleType == "cabrio", 1, 0)
mydata$coupe <- ifelse(mydata$vehicleType == "coupe", 1, 0)
mydata$kleinwagen <- ifelse(mydata$vehicleType == "kleinwagen", 
    1, 0)
mydata$kombi <- ifelse(mydata$vehicleType == "kombi", 1, 0)
mydata$limousine <- ifelse(mydata$vehicleType == "limousine", 
    1, 0)
mydata$suv <- ifelse(mydata$vehicleType == "suv", 1, 0)
mydata$automatik <- ifelse(mydata$gearbox == "automatik", 1, 
    0)
mydata$manuell <- ifelse(mydata$gearbox == "manuell", 1, 0)
mydata$benzin <- ifelse(mydata$fuelType == "benzin", 1, 0)
mydata$diesel <- ifelse(mydata$fuelType == "diesel", 1, 0)

In this first set up, we only use several most significant categorical variables and create dummies for them.

names(mydata)
##  [1] "dateCrawled"         "price"               "vehicleType"        
##  [4] "yearOfRegistration"  "gearbox"             "powerPS"            
##  [7] "model"               "kilometer"           "monthOfRegistration"
## [10] "fuelType"            "brand"               "notRepairedDamage"  
## [13] "dateCreated"         "postalCode"          "lastSeen"           
## [16] "age"                 "sellingTime"         "kilo2"              
## [19] "age2"                "bus"                 "cabrio"             
## [22] "coupe"               "kleinwagen"          "kombi"              
## [25] "limousine"           "suv"                 "automatik"          
## [28] "manuell"             "benzin"              "diesel"
set.seed(123)  # randomly set 70% training data set and 30% testing data set
subdata <- sample(nrow(mydata), floor(nrow(mydata) * 0.7))
training <- mydata[subdata, ]
validation <- mydata[-subdata, ]
names(training)
##  [1] "dateCrawled"         "price"               "vehicleType"        
##  [4] "yearOfRegistration"  "gearbox"             "powerPS"            
##  [7] "model"               "kilometer"           "monthOfRegistration"
## [10] "fuelType"            "brand"               "notRepairedDamage"  
## [13] "dateCreated"         "postalCode"          "lastSeen"           
## [16] "age"                 "sellingTime"         "kilo2"              
## [19] "age2"                "bus"                 "cabrio"             
## [22] "coupe"               "kleinwagen"          "kombi"              
## [25] "limousine"           "suv"                 "automatik"          
## [28] "manuell"             "benzin"              "diesel"
fit = lm(price ~ kilometer + age + age2 + powerPS + bus + cabrio + 
    coupe + kleinwagen + kombi + limousine + suv + automatik + 
    manuell + benzin + diesel, data = training)
summary(fit)
## 
## Call:
## lm(formula = price ~ kilometer + age + age2 + powerPS + bus + 
##     cabrio + coupe + kleinwagen + kombi + limousine + suv + automatik + 
##     manuell + benzin + diesel, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20038  -1528   -177   1152  21395 
## 
## Coefficients:
##                Estimate  Std. Error t value   Pr(>|t|)    
## (Intercept) 5111.082157   61.011724   83.77    < 2e-16 ***
## kilometer     -0.030368    0.000192 -157.96    < 2e-16 ***
## age         -775.157459    2.939707 -263.69    < 2e-16 ***
## age2          15.457295    0.074919  206.32    < 2e-16 ***
## powerPS       37.800323    0.189989  198.96    < 2e-16 ***
## bus         5619.235445   36.831636  152.57    < 2e-16 ***
## cabrio      6965.620007   40.163726  173.43    < 2e-16 ***
## coupe       5821.499212   43.365438  134.24    < 2e-16 ***
## kleinwagen  5233.422360   33.729914  155.16    < 2e-16 ***
## kombi       4861.568755   34.471222  141.03    < 2e-16 ***
## limousine   5313.209462   33.847225  156.98    < 2e-16 ***
## suv         6862.441150   45.703726  150.15    < 2e-16 ***
## automatik    521.420355   51.082933   10.21    < 2e-16 ***
## manuell      256.257053   48.989763    5.23 0.00000017 ***
## benzin       406.083886   25.530305   15.91    < 2e-16 ***
## diesel      1774.585921   26.981526   65.77    < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2530 on 188345 degrees of freedom
## Multiple R-squared:  0.668,  Adjusted R-squared:  0.668 
## F-statistic: 2.52e+04 on 15 and 188345 DF,  p-value: <2e-16
fit.step = stepAIC(fit)  ## stepwise variable reducation 
## Start:  AIC=2952648
## price ~ kilometer + age + age2 + powerPS + bus + cabrio + coupe + 
##     kleinwagen + kombi + limousine + suv + automatik + manuell + 
##     benzin + diesel
## 
##              Df    Sum of Sq           RSS     AIC
## <none>                       1209729859815 2952648
## - manuell     1    175741884 1209905601699 2952673
## - automatik   1    669204146 1210399063961 2952750
## - benzin      1   1625001720 1211354861535 2952899
## - diesel      1  27784037301 1237513897116 2956923
## - coupe       1 115748805214 1325478665029 2969857
## - kombi       1 127753792196 1337483652011 2971556
## - suv         1 144806600755 1354536460570 2973942
## - bus         1 149501912297 1359231772113 2974594
## - kleinwagen  1 154623470813 1364353330628 2975302
## - limousine   1 158271249860 1368001109675 2975805
## - kilometer   1 160262193055 1369992052871 2976079
## - cabrio      1 193190549017 1402920408832 2980553
## - powerPS     1 254253967275 1463983827090 2988578
## - age2        1 273412024653 1483141884468 2991027
## - age         1 446587062511 1656316922327 3011829
summary(fit.step)
## 
## Call:
## lm(formula = price ~ kilometer + age + age2 + powerPS + bus + 
##     cabrio + coupe + kleinwagen + kombi + limousine + suv + automatik + 
##     manuell + benzin + diesel, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20038  -1528   -177   1152  21395 
## 
## Coefficients:
##                Estimate  Std. Error t value   Pr(>|t|)    
## (Intercept) 5111.082157   61.011724   83.77    < 2e-16 ***
## kilometer     -0.030368    0.000192 -157.96    < 2e-16 ***
## age         -775.157459    2.939707 -263.69    < 2e-16 ***
## age2          15.457295    0.074919  206.32    < 2e-16 ***
## powerPS       37.800323    0.189989  198.96    < 2e-16 ***
## bus         5619.235445   36.831636  152.57    < 2e-16 ***
## cabrio      6965.620007   40.163726  173.43    < 2e-16 ***
## coupe       5821.499212   43.365438  134.24    < 2e-16 ***
## kleinwagen  5233.422360   33.729914  155.16    < 2e-16 ***
## kombi       4861.568755   34.471222  141.03    < 2e-16 ***
## limousine   5313.209462   33.847225  156.98    < 2e-16 ***
## suv         6862.441150   45.703726  150.15    < 2e-16 ***
## automatik    521.420355   51.082933   10.21    < 2e-16 ***
## manuell      256.257053   48.989763    5.23 0.00000017 ***
## benzin       406.083886   25.530305   15.91    < 2e-16 ***
## diesel      1774.585921   26.981526   65.77    < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2530 on 188345 degrees of freedom
## Multiple R-squared:  0.668,  Adjusted R-squared:  0.668 
## F-statistic: 2.52e+04 on 15 and 188345 DF,  p-value: <2e-16
library(car)

vif(fit)
##  kilometer        age       age2    powerPS        bus     cabrio 
##      1.503     11.151      9.010      1.814      3.375      2.734 
##      coupe kleinwagen      kombi  limousine        suv  automatik 
##      2.265      6.172      5.502      6.651      1.920     10.907 
##    manuell     benzin     diesel 
##     10.689      4.436      4.493
## new model remove several variables with high correlation
## rates
newfit = lm(price ~ kilometer + age2 + powerPS + bus + cabrio + 
    coupe + suv + automatik + benzin + diesel, data = training)
summary(newfit)
## 
## Call:
## lm(formula = price ~ kilometer + age2 + powerPS + bus + cabrio + 
##     coupe + suv + automatik + benzin + diesel, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17568  -1783   -215   1300  23956 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) 6581.510118   40.982347   160.6   <2e-16 ***
## kilometer     -0.056663    0.000193  -293.3   <2e-16 ***
## age2          -2.826215    0.031314   -90.2   <2e-16 ***
## powerPS       41.470491    0.190304   217.9   <2e-16 ***
## bus          634.510547   24.328802    26.1   <2e-16 ***
## cabrio      1708.879300   29.365214    58.2   <2e-16 ***
## coupe        767.872731   35.020304    21.9   <2e-16 ***
## suv         2000.926449   39.601167    50.5   <2e-16 ***
## automatik    177.956343   19.344263     9.2   <2e-16 ***
## benzin       354.888429   28.449853    12.5   <2e-16 ***
## diesel      2488.721357   30.077157    82.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2970 on 188350 degrees of freedom
## Multiple R-squared:  0.542,  Adjusted R-squared:  0.542 
## F-statistic: 2.23e+04 on 10 and 188350 DF,  p-value: <2e-16
vif(newfit)  ## less multicollinearity
## kilometer      age2   powerPS       bus    cabrio     coupe       suv 
##     1.103     1.144     1.323     1.070     1.062     1.073     1.048 
## automatik    benzin    diesel 
##     1.137     4.003     4.057
training$predict <- predict(newfit)
training$error <- residuals((newfit))

validation$predict <- predict(newfit, newdata = validation)
validation$error <- validation$predict - validation$price

hist(training$error)

hist(validation$error)

ggplot(data = validation, aes(x = price, y = predict)) + geom_point(alpha = 0.2) + 
    geom_smooth() + facet_wrap(~vehicleType) + xlab("price") + 
    ylab("predicted") + ggtitle("price vs predicted price")

ggplot(data = validation, aes(x = price, y = error)) + geom_point(alpha = 0.2) + 
    geom_smooth() + facet_wrap(~vehicleType) + xlab("price") + 
    ylab("error") + ggtitle("price vs predicted price error")